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The expansion of a (CAG/CTG)ri triplet repeat has 
been found to be associated with at least seven genetic 
diseases, suggesting that this mechanism of disease 
may be fairly common. To accelerate the discovery of 
new loci containing (CAG/CTG)^^, triplet expansions, we 
have isolated numerous genomic clones containing 
this class of repeats. We have developed 338 sequence- 
tagged sites (STSs) containing (CAG/CTG)„ repeat se- 
quences. Two hundred ninety-nine STSs were unam- 
biguously assigned to chromosomes, and 89 of the total 
were assigned to YACs. The 141 STSs that were devel- 
oped based on (CAG/CTG)^^ repeats of at least seven 
units were genotyped on four reference CEPH individ- 
uals to estimate their polymorphic quality. © i996 Aca- 
demic Press, Inc. 



INTRODUCTION 

Expansions of trinucleotide repeats have been found 
to be associated with at least nine human diseases, 
including Fragile X syndrome, FRAXE mental retarda- 
tion, myotonic dystrophy, Kennedy syndrome, Hun- 
tington's disease, spinocerebellar ataxia type 1, denta- 
torubral-pallido-luysian atrophy, Haw River disease, 
and Machado- Joseph disease (Verkerk etaL, 1991; La- 
Spada et al, 1991; Brook et aL, 1992; Mahadevan et 
ah, 1992; Fu et al, 1992; The Huntington's Disease 
Collaborative Research Group, 1993; Orr et ah, 1993; 
Knight etal, 1993; Koide etal, 1994; Nagafuchi etal, 
1994; Burke etal, 1994; Kawaguchi etal, 1994). Each 
of these diseases, as well as two additional fragile sites 

^ Present address: Millennium Pharmaceuticals Incorporated, 
Cambridge, MA 02139. 

^ To whom correspondence should be addressed at Center for Ge- 
nome Research, Whitehead Institute/MIT, 1 Kendall Square Build- 
ing 300. Room 525, Cambridge, MA 02139. Telephone: (617) 252- 
1912; Fax: (617) 252-1902. 



[FRAXF (Nancarrow etal, 1994) andFRA16A (Parrish 
et ah, 1994)], is associated with the expansion of a 
(CGG)^ or (CAG/CTG)^ repeat. 

In an attempt to facilitate the discovery of other dis- 
ease states associated with trinucleotide repeat expan- 
sions, we have initiated a global approach to cloning 
trinucleotide repeats from genomic DNA. We have de- 
veloped STSs based on all 10 classes of trinucleotide 
repeats generated from marker-enriched small insert 
genomic libraries. The results of a survey of the 10 
classes of trinucleotide repeats for their usefulness as 
new genetic markers have been presented elsewhere 
(Gastier et ah, 1995). Here, we present the results of 
our efforts to generate large numbers of STSs based on 
(CAG/CTG)^ repeats, since this class has been found to 
be associated with the majority of the diseases listed 
above (all but the fragile sites) . These STSs are a valu- 
able screening set in the search for disease mutations 
suspected to be caused by a trinucleotide repeat expan- 
sion, such as other neurodegenerative diseases and dis- 
eases showing evidence of emticipation. 

MATERIALS AND METHODS 

Hybridization conditions. PI clones (DuPont-Mercl^ and marker- 
enriched small insert clones [prepared as described previously (Pul- 
ido and Duyk, 1994)] were picked into 96-well plates and replicated 
onto MAGNA-Nylon membranes (Micron Separations, Inc.). Cosmid 
clones (Stratagene) were lifted directly off plates. After being grown 
and fixed on the membranes, clones were screened using the Quick- 
Light hybridization system (FMC Corp.) . Hybridization and washes 
were performed at SS^C, and control clones were used on all primary 
and secondary screenings to enhance the number of positives that 
were picked that had a repeat length of at least five units. 

Development of sequence-tagged sites. Hybridization-positive 
clones were subjected to single-pass cycle sequencing using the Ml 3 
(-21) and/or SP6 dye primer kits (Applied Biosystems, Inc.) with the 
ABI373 automated sequencer. Template DNA was prepared using 
the Magic Minipreps kit (Promega Corp.). Duplicate clones were de- 
tected using Sequencher (GeneCodes Corp.). Primers flanking the 
repeat were chosen using the Primer program (MITAVhitehead Insti- 
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tute) as implemented by the CHLC (Cooperative Human Linkage 
Center) primer pipeline. All primers were selected to have a 7^ close 
to 60°C. For information on the CHLC pipeline server, send a blank 
e-mail message to primer-server@chlc.org. 

Polymorphism analysis. Primers were genotyped on four refer- 
ence CEPH individuals (1331 01, 1331 02, 1408 01, and 1408 02) 
under standard PGR conditions. Two children of 1331 01 and 1331 
02 (1331 03 and 1331 04) were also typed to ensure a Mendelian 
inheritance of the alleles. 

Chromosome assignment and YAC screening. Tentative localiza- 
tion of each STS to a specific human chromosome was accomplished 
by PCR-based screening of the National Institute of General Medical 
Sciences somatic cell hybrid mapping panels 1 and 2. These STSs 
are being included in the YAC mapping effort at the Whitehead 
Institute/MIT Center for Genome Research (Bell et al, 1995). The 
STS content data are freely available via the web server at http:// 
www.genome.wi.mit.edu. 

BLAST searches. Various "masking" procedures were applied to 
the query sequences prior to database homology searching. First, 
the (CAG/CTG)^, sites themselves (along with other regions of low 
compositional complexity such as homopolymeric tracts) were identi- 
fied and masked using the NSEG program (Wootton and Federhen, 
in press). Masking consists of replacing individual nucleotides with 
the character "X," which is ignored by the BLAST family of programs 
(Schuler et ah, 1995). These modified query sequences were then 
screened against a database of vector sequences using the BLASTN 
program with the E parameter set to le-06. All regions of the query 
sequence matching vector sequences were then masked using the 
xblast utility (Claverie and States, 1993). These doubly masked 
query sequences were then searched against the nonredundant nu- 
cleotide sequence database using BLASTN with E at le-06 as before. 
The minimum BLAST score that was accepted for homology was le- 
37. Query sequences containing ylA/ repetitive elements were identi- 
fied by matches to "ALU WARNING" entries in GenBank (Schuler 
et al, 1995) and by analyzing the alignment outputs. All BLAST 
searches were carried out using the BLAST network server and data- 
bases at NCBI (Schuler et al, 1995). 

Electronic access to data. CHLC maps and marker information 
are available though several electronic information sources: anony- 
mous ftp (ftp.chlc.org), Gopher (gopher.chlc.org), and World Wide 
Web (WWW) (http://www.chlc.or^. Table 2 is available at the CHLC 
WWW site or though the Whitehead/MIT Center for Genome Re- 
search WWW site (http://www-genome.wi.mit.edu). All sequences 
have been submitted to GDB and Genbank. 

RESULTS 

Generation of (CAG/CTG)„'Based STSs 

We estimate that there are approximately 1500 loci 
in the human genome with at least five perfect (CAG/ 
QJYG)n units, based on screening a human genomic cos- 
mid library. In addition, we have screened 800 random 

TABLE 1 
Recovery of (CAG/CTG)^, STSs 



Number of clones 



Primers designed 


375 (33.0%) 


No suitable primers 


24 (2.1%) 


Misplaced repeat 


209 (18.4%) 


Repeat length <5 


125 (11.0%) 


Poor sequence 


67 (5.9%) 


Duplicates 


338 (29.7%) 


Total sequenced 


1138 (100%) 




5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 



Number of repeats 

FIG. 1 . Distribution of (CAG/CTG),, repeat lengths in 479 unique 
clones. 

human genomic PI clones to confirm this number. In 
this screen, 4% of the clones were positive. Assuming 
a 90-kb insert size for the PI clones and 3000 Mb as 
the size of the genome, we calculated that there are 
approximately 1300 (CAG/CTG)^-containing loci in the 
genome. 

Since the (CAG/CTG)^ class of short tandem repeats 
is infrequent, on average every 2000-2500 kb, we have 
utilized small insert-marker-enriched libraries for 
screening. Hybridization conditions were optimized to 
yield repeat lengths of 5 or greater based on the discov- 
ery that the most common allele of the myotonic dystro- 
phy locus has a repeat length of 5 (Imbert et ah, 1993). 
We have screened a total of approximately 34,000 
marker-enriched colonies and are attempting to se- 
quence 2400 positive clones so that primers flanking 
the repeats c£in be designed. This paper describes the 
set of STSs that was developed based on the first 11 38 
positive clones. The losses encountered in this process 
are shown in Table 1. It is interesting that this class 
of repeats is rarely associated with Alu repetitive ele- 
ments, since it is believed that microsatellites may 
have derived from Alu elements (Beckmann and We- 
ber, 1992). Of the 24 failures due to no suitable primers, 
11 were due to an Alu element flanking the repeat. 
This means that only 1% of the original clones failed 
to have primers designed due to an Alu, as opposed to 
other classes of trinucleotide repeats, such as (ATA)^ 
and (AAC)^, in which 30-40% of the clones failed due 
to an Alu element (Gastier et ah, 1995). 

Polymorphic Character of the (CAG/CTG)„ STSs 

We were able to identify a (CAG/CTG)^ repeat in 479 
unique sequences obtained in this survey: the distribu- 
tion of the length of the repeat in these clones is shown 
in Fig. 1. The majority of the repeats that were se- 
quenced were short in length. In a previous survey of 
the 10 classes of trinucleotide repeats, we showed that 
only repeats of eight or more units tended to be poly- 
morphic (Gastier et ah, 1995). This suggested to us 



(CAG/CTG),, SCREENING SET 



TABLE 2 

STSs Developed for (CAG/CTG)^ Repeats 



CHLCname(a) Repeat Alleles YAC BLAST homology 
Sequenced 4 CSPHs <b) data (c) 



Chtonrtosome 1 








GCT1C9 


1 5 


1 




GCT1 E07 


7 


* 


bi 1 / ^7vV) 1 O 1 4 1 IVwt, WWI^rl ntJ ( ^%30 


QCT3G04 


6 




none 


GCT3M01 


7 






GCT4A10 


5 


1 ft 


L')<^7Qa MB74B7--cfilcfum channel 


GCT4B11 




1 * 




QCT4t>lc 




-f 


none 


GCT6B08 


6 




none 




6 




none 




7 




Td6519, T8443a cONA clones 


/UU1 


g 




none 


(jC> I / CUO 


7 




none 


1 (SOU/ 


8 


1 * 


none 


vjiv* 1 1 ua I ic. 


1 0 


2 


none 




6 




none 






1 


none 


1 1 UuUo 


1 1 


2 


none 


GCT11C08 


6 




none 


GCT11Q06 


6 


1 


U17904 - Human STS UT1852 


GCT1 1 HOG 


O 




none 


GCT12H03 


O 




none 


GCT13A06 




-J 


none 


GCT14t08 


e 
9 




none 


GCT14G04 


g 




none 


GCTIoAOl 


g 




none 


GCTlocT 1 


1 0 


1 * 


none 


GCT15Q0Z 


7 


1 * 




QCT15G08 




"I * 


none 


GCTlSnOS 






none 


GCTidCu3 






none 


QCTIecOB 






none 


GCT17m)4 


O 




none 


GCT17fci2 


■] -f 


7 * 


none 


ChrofTiosorho 2 








GCT1B4 


s 




none 


GCT3A11 


6 






Gcraciz 


O 




none 


QCT3D1Z 


c 
O 




none 


GCT4A02 


5 




none 


GCT4A03 


5 




none 


QCT4D02 


5 






GCT5A0Q 






none 


GCT5A09 


O 




none 


GCT5C07 


y 


2 


none 


ul-f 1 5^1 1 


g 




none 




g 




none 


GCT6D03 


6 




none 


GCT6H03 


7 




none 


GCT7C03 


7 






GGT8B09 (D2S1397) 


1 0 




none 


GCT9C02 


7 




none 


OCT10B07 


5 






GCT10007 


5 






GCT10F01 


7 




none 


GCT1OF03 


1 u 




276570<slnipld DNA seq. W0la5 


QCTilBiZ 






none 


GCT11G10 


9 




X76582-$finple DNA seq« wglelO 


GCTi 5BU8 


w 




none 


GCT1 SD07 


7 






GCri5F06 


6 




none 


GCT15G07 


6 




none 


GCT16H01 


7 




none 


Chromosome 3 








GCT1A10 


16 




L10376. T070O7-CTG-B33 mRNA, EST HFBEC27 


GCTI 801 


6 




none 


GCT1B06 


6 




none 


GCT1D06 


6 




none 


GCT1D8 


9 




none 


eCT2A09 


6 




none 


GCT2C10 


5 




none 


QCT3B12 


7 


2 


none 


GCT3C11 CD3S2399) 


S 




none 


GCT4B10 (O3S2400) 


16 


2 


none 


GCTSE11 (0382401) 


23 


7 


none 
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TABLE 2 — Continued 



CHLC name (a) 


Repeat 


Alleles 


YAC 


BLAST homology 




Sequenced 


4 CePHs (b) 


data (c) 


GCT6G12 


6 






none 


eCT8B03 


7 


2 




none 


6CT8C05 


1 1 


2 


* 


nonq 


GCTeEn 


7 


2 




none 


GCT10A08 


6 






none 


GCT16O09 


6 






none 


GCT12G01 


g 


1 




none 


GCT14C07 


7 


1 


* 


M87731 -human simple repeat polymorphism 


GCT14D10 


7 


1 




T28895, M97287-EST 59645, SATB1 mRNA 


GCT15G11 


6 






none 


GCT16A08 


6 






none 


GCT17801 


9 


1 




none 


6CT17B07 


7 


2 




none 


Chromosome 4 










GCT1B3 


7 


1 




none 


GCT2C0e 


5 


1 




none 


QCT4S02 


6 






none 


GCT6B12 


5 






none 


GCT6F01 


5 






none 


GCT6F03 <D4S2430) 


12 


4 




none 


GCT7C02 


6 






none 


GCT7G03 


7 


2 




none 


GCT8D06 (D17S1292) 


10 


1 




L17909'STS UT1861 


GCT10e09 


s 


1 




none 


GCT12H11 


5 






none 


GCT13F01 


6 






none 


QCT14E02 


7 


2 


* 


none 


GCT15O08 


a 


1 


* 


none 


GCT160O4 


9 


3 


* 


none 


GCTieC02 


5 




* 


none 


GCT17D01 


5 




• 


none 


Chromosome S 










GCT1A01 


6 






none 


GCT5A12 


6 






none 


GCT5E05 {05S1472) 


8 


2 


* 


none 


GCT6A09 


8 


1 


* 


none 


GCt6ei2 


6 






none 


GCT7e01 


5 






none 


QCT7F10 


5 






none 


GCTiOA04 


9 


1 




none 


GCTIOEOe 


7 


1 




none 


QCT10F04 


10 


2 




none 


GCT10G10 


8 


1 


* 


none 


GCT11E05 


7 


1 




none 


GCT11G01 


6 






none 


GCT11H05 


6 






none 


GCT15A05 


6 






none 


GCTiseoe 


5 






none 


QCT15E04 


5 






none 


Chromosome 6 










GCT4A11 


5 






none 


GCT4B05 (D6S1014) 


11 


4 




none 


GCTSA01 


6 






none 


GCTSA02 


8 


1 


• 


none 


GGT5E07 (D6S1015} 


11 


2 


• 


none 


GCT6G02 <D6S1059) 


8 


3 


• 


none . 


QCrSGOS 


S 






none 


GCT10C05 


6 






none 


GCT11E01 


5 






none 


GCTiaeos 


8 


1 




Lie099-STS UT2607, X73969-wg1f4 repeat region 


GCT12Bia 


5 






none 


GCT12G04 


S 






none 


GCT16B08 


6 


2 




none 


GCT16D06 


7 


2 




none 


GCT16F02 


8 


1 




none 


Chromosome 7 










GCT9C05 


7 


2 




none 


QCT10E08 


13 


7 




none 


GCTIOFOe 


5 






none 


GCT13H07 


7 


2 


i 


none 


GCT14A05 


9 


1 




L17905-STS UT1853 


GCT14B10 


S 






none 


GCT1SG01 


6 






none 


GCT16H03 


5 






none 



(CAG/CTG),, SCREENING SET 
TABLE Z — Continued 



CHLC name (a) 



Repeat Afleles YAC 
Sequenced 4 C£PKs (b) data (c) 



BLAST homology 



Chromosome 8 

GCT4E02 

GCTSC04 

GCT5H01 

GCT6G07 

GCT7F01 

GCT9A01 

GCT10D12 

GCTioeoi 

GCT10FD9 
GCT10H04 
GCT13F07 
QCT15A02 
GCT17F04 



none 
none 
none 
none 
none 
none 
none 
none 

L17749-STS UT1022 
none 
none 
none 
none 



Chromosome 9 

GCT3G05 

QGT8A01 

6CTBB04 

eCT8G09 

GCT11F04 

GCTiaeoa 

GCT14H05 
GCT16D0e 
GCT16E0e 
GCT16G03 
GCT17e09 
GCT17C06 



9 
7 
5 
6 

a 

6 
6 
6 

14 
8 
1 1 

5 



none 
none 
none 
none 
none 
none 
none 
none 
none 
none 
none 
none 



Chromosome 10 

ACT3E01 

GGT1C06 

GCT3A04 

GCT3E03 

GCT3F05 

GCT8C03 

GCT8C11 



1 1 
6 
8 
6 
5 
8 
5 



none 
none 
none 
none 
none 
none 
none 



Chromosome 11 

GCT5G08 

GCT7a08 

GCT8e07 

GCT10A01 

GCT13C12 

GCT13D04 

6CT14C12 

GCT14E11 

GCT16A03 

GCT16B07 

QCT16B10 

GCTieR)7 

GCT16G07 

GCT17D11 



none 
none 
none 
none 

Z15459-partial cONA clone 20B07 
none 

XI 4972. X53773-fnouse, rat alpha-adaptln 
none 
none 
none 
none 
none 
none 
none 



Chromosome 12 

GCT1A11 6 2 

GCT1C5 8 1 

GCT5A05 5 1 

GCT5D05 5 

eCT6E07 (D12S1072) 12 3 

GCT8B07 9 1 

GCT8G12 9 

GCT9C01 7 1 

QCT12Q10 6 



none 
none 

J04182-lamp'1mRNA 
none 
none 
none 
none 
none 
none 



Chromosome 13 

ACT3F12 

6CT1C11 

GCT4G06 

GCT7B03 

GCT7B05 

GCT7F02 

GCT13E04 

GCT16A11 

GCT16C05 

GCT16F03 

GCT17F01 



6 
5 
6 
19 
7 
6 
5 
6 
6 
6 
5 



none 
none 
none 

R18580-CDNA Clone 30262 
none 
none 
none 
none 
none 
none 
none 
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CHLC name (a) 


Repeat 


Alleles 


YAC 


BLAST homology 




Sequerx:ed 


4 CEPHs (b) 


data (c) 




'phfgmosome 14 










GCT2B12 


6 






none 


GCT2C07 


5 






none 


<aCT6H01 


7 


1 


* 


none 


QCT7H01 


7 


1 




none 


GCT8B05 


7 


1 




none 


GCT8D03 


7 


1 




none 


GCTSeiO 


6 






none 


GCT11B02 


5 






none 


GCT13E12 


6 






none 


GCT14E06 


8 


1 




none 


GCT15609 


5 






none 


GCT16B06 


6 






none 


Chromosome 15 










GCT1C8 


8 


2 




none 


GCT2C03 


6 






none 


GCT3B06 


6 






none 


GCT4G01 


8 


1 




none 


GCT5C12 


5 






none 


GCT6B06 


11 


1 




X76569*human simple ONA seq clone wgla4 


GCTeF04 


6 






none 


GCT7C09 


9 


2 


• 


L179t1-STS UT1869 


QCT10E11 


5 






none 


QCTt1A04 


8 


1 




none 


QCT11A05 


5 






none 


GCT12B11 


7 


1 




none 


GCT13E05 


5 






none 


GCT13e07 


6 






none 


GCT13F05 


5 


1 




L35568, S70721-lslet 2, Islet 1 mRNA (many spedes) 


GCT14H07 


7 


3 


• 


none 


Chromosome 16 










GCT2005 


8 


1 




none 


GCT3B03 


S 






none 


GCT3B05 


8 


1 


• 


none 


QCT3B11 


8 


2 




none 


GCT7F04 


5 






none 


GCT7F11 


6 






none 


GCriOB02 


8 


1 




none 


GCT10D03 


7 


1 




none 


GCT10E09 


6 






none 


GCT13F06 


5 






none 


GCT13F09 


7 


1 


• 


none 


GCT14B11 


7 


1 


• 


none 


GCT15A12 


6 






none 


GCT15C04 


8 


1 




none 


QCT15D10 


9 


1 




none 


GCTt6F05 


6 






none 


GCT16F08 


6 




• 


L26339-human autoantfgen mRNA 


Chromosome 17 










QCT1E1 (D17S1291) 


6 




• 


none 


GCT6E11 


10 


2 


• 


none 


GCT7A04 


5 






none 


GCT7011 


6 






F182424. R31127 cDNA clones 


QCT10C02 


6 


1 




none 


GGT10O04 


13 


2 




029801 -mouse mRNA ORF 


GCT13F02 


5 






none 


GCT14B05 


5 






none 


GCT15E02 


6 






none 


GCT16D12 


8 


1 




none 


GCT17C04 


7 


1 




none 


6CT17F07 


6 






none 


Chrornosome 18 










GCT3A09 


6 


1 




none 


GCT3E06 (D 185880) 


7 


3 




none 


QCT3G01 


8 


1 




none 


GCT5D07 (D18S852) 


8 


2 




none 


GcrsFia 


6 


1 




none 


GcrsGOi 


5 






none 


GCT7G01 


6 






none 


GCT13D05 


6 






none 


Chromosome 19 










GCT2C12 


6 


1 




none 


QCT4A09 


7 


1 




none 


GCT5G03 


5 






none 


GCT13A07 


8 


1 




none 
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81 



CHLC name (a) 


Repeat 


Alleles 


YAC 


BLAST homokjpy 




Sequenced 


4 CEPHs (b) 


data (c) 




GCT15A10 


9 


? 




none 


Ohronfiosoni& 20 










GCT10C10 


14? 


5 




none 


GCT10F11 


9 


1 




none 


GCT11G03 


8 


2 




none 


GCT11G09 


7 


1 




none 


GCT12F08 


7 


1 




none 


GCT13802 


6 






none 


GCT13C07 


8 


1 




none 


GCT14B04 


6 






none 


GCT14G11 


5 






none 


Chromosome 21 










GCT2B10 


6 






none 


GCT12D02 


6 






none 


GCT15A04 


6 


1 




M34876-amyloid-beta gene (APR) 


Chromosome 22 










GCT6F02 


7 


1 




none 


QCT16D01 


7 


1 




none 


X Chromosome 










GCT4C10 


7 ■ 


2 




none 


GCT5D10 


5 






none 


GcreDoe 


6 






none 


GCT7D06 


7 


1 




none 


QCT13B01 


5 






none 


GCT13012 


6 






none 


GCT14E12 


5 






none 


No Chfomosomo-High 


Background 








GCT1C07 


6 






none 


GCT1D04 


5 


0 




X52ei1-AP-2 transcrtptlon factor 


GCT1D12 


6 


0 




none 


GCT1E10 


8 


0 




none 


GCT2D02 


5 






none 


GCT2C04 


7 






none 


GCT3A10 


5 






X81699-B, taurus sodium dependentphosphate transporter 


GCT3B04 


5 






none 


GCT3C10 


6 






T2S820. R73200 cDNA clones CAG-isl 6. 166123 


GCT3DI0 


6 






none 


GCT3E07 


7 


0 




L179B4-STS UT2ie3 


GCT3H07 


7 


0 




MdeeSd-dlpeptldy) aminopeptidase like protein mRNA 


GCT4B04 


5 






ruine 


QCT4B08 


6 


0 




none 


GCT5A10 


6 






none 


GCT5D01 


5 






none 


GCT5G04 


6 






none 


GCT6O08 


6 






none 


GCT10A11 


6 






none 


QCT10G11 


12 


3 




none 


QCT10H03 


5 






rwne 


GCTIOHOe 


18 


1 




none 


GCT12B01 


6 






none 


GCT12B04 


6 






none 


GCT12D0e 


5 






none 


GCT12G03 


6 






none 


GCT13A10 


5 






none 


GCT13C01 


4 






none 


GCT13G09 


5 






none 


6CT14A01 


11 


0 




none 


GCT14B02 


8 


1 




none 


GCT14C05 


5 






none 


GCT14F06 


12 


? 




none 


GCTUHOe 


5 






none 


GCT15B10 


6 






none 


GCT15D03 


5 






none 


GCT16D03 


5 






M2e432, M21772-human keratin type 16, keratin pseudogene 


GCT16G06 


6 






none 


'QCT16Q12 


6 






none 



Note, STSs are arranged by chromosome, as assigned by somatic cell hybrid panel mapping. 

"^Name of clone and primers listed in CHLC database. ACT clones were identified in a screen for that class of repeats, but also contain 
a (CAG/CTG),, repeat. 

* Number of alleles in four CEPH individuals (?, results could not be interpreted due to smears or >2 alleles/individual) . 

Indicates that the STS has been mapped to a YAC. The data is available through the World Wide Web at 
http://www-genome.wi.mit.edu. 
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M relative chromosome size 
n STSs assigned 




5 6 7 8 9 10111213141516171819 20 21 22 X 

Chromosome 



FIG. 2. Distribution of chromosomal assignments for 299 (CAG/ 
CTG)jrContaining STSs. The genetic length of a chromosome was 
divided by the total genetic length of the genome, based on the most 
recent CHLC maps (Murray et al, 1994). 

that most of the (CAG/CTG)^ STSs would not be highly 
polymorphic. To test this, all STSs based on a repeat 
length of 7 or greater were tested for their informa- 
tiveness in four reference CEPH individuals, assuming 
that an STS with three alleles in the four individuals 
will tend to be polymorphic in the general population. 
This is the same method employed to estimate the poly- 
morphic quality of a given STS based on an (AC)y, re- 
peat (Weissenbach et al, 1992). This method is only a 
crude estimate of informativeness, but has been useful 
for determining the STSs that are most likely to be 
polymorphic. Additional individuals must be typed to 
determine true heterozygosity frequencies. STSs based 
on repeats of lengths 5 and 6 were generally not typed 
on the CEPH individuals for cost and efficiency rea- 
sons. A small number of these shorter repeats were 
tested, but most were monoallelic, as shown in Table 
2. However, the STSs are included in the (CAG/CTG)^ 
survey because of the data on myotonic dystrophy, as 
stated above. 

STSs Developed 

A summary of the STSs developed in this survey is 
shown in Table 2. A full table including primer se- 
quences is available at the CHLC or Whitehead/MIT 
World Wide Web site (see Materials and Methods). The 
chromosome assignments were obtained by somatic cell 
hybrid mapping for 299 (88%) of the STSs. Figure 2 
shows the chromosome distribution of the assigned 
STSs vs a normalized length for each chromosome. 
Thirty-nine (12%) STSs could not be assigned to a spe- 
cific chromosome due to high mouse or hamster back- 
ground. We have included these STSs at the end of 
Table 2 because they may have primers designed in 
conserved regions, suggestive of coding sequence. The 
sequence of each clone is available through 
www.chlc.org and may allow the selection of alternate 
primers to amplify these loci. 

Other Loci Identified in the Screen 

Table 3 shows a list of additional loci that were de- 
tected in the screen and identified by BLAST searches. 



These include loci for which primers were developed 
but did not ctmplify; sequences where the repeat length 
was less than 5; clones that were positive by hybridiza- 
tion, but the sequence did not extend into the repeat 
region; and sequences where primer design was impos- 
sible due to lack of flanking sequence. 

DISCUSSION 

We have developed a set of STSs to be used for 
screening diseases suspected to be caused by the expan- 
sion of a (CAG/CTG)^ repeat. Since seven diseases have 
been found to be associated with a trinucleotide repeat 
of this class, it is likely that other diseases have a simi- 
lar mechanism. By utilizing marker-enriched libraries, 
we have generated 2400 clones that are positive for 
this type of repeat. This survey includes the results of 
analyzing the first 1138 of these clones, and we esti- 
mate that 800 of these were unique from each other. 
These and future STSs can be obtained through the 
World Wide Web at www.chlc.org. 

Frequency estimations indicate that this class of re- 
peats is 50-fold less frequent than the (CA)^ repeats 
We have found that the repeats tend to be shorter in 
length and less polymorphic than those in other tri- 
nucleotide repeat classes (Gastier et ah, 1995). This 
suggests that there may be an evolutionary restriction 
on the mutation rate of this class. 

The (CAG/CTG) ^-containing STSs described in this 
paper complement the two previous approaches that 
have been used to identify trinucleotide repeats that 
may cause disease. Several groups have screened the 
databases and cDNA libraries for genes containing tri- 
nucleotide repeats (Riggins etal, 1992; Li etah, 1993). 
Based on these efforts, the gene responsible for DRPLA 
and Haw River syndrome has been cloned (Koide et 
a7., 1994; Nagafuchi et ah, 1994; Burke et al., 1994), 
demonstrating the utility of a random screening ap- 
proach for identifying loci associated with trinucleotide 
repeats. In this paper, the sequencing of genomic DNA 
instead of cDNAs allowed for primer design without 
the concern for intron/exon boundaries. In addition, we 
may have identified some triplet- containing gene se- 
quences that are not expressed at levels high enough 
to be detected in cDNA libraries. Since (CAG/CTG),, 
repeats have been shown to be relatively rare in introns 
and concentrated in coding sequence (Stallings, 1994), 
it is likely that many of the STSs that we have identi- 
fied are located in exon sequences. 

We have identified several loci that were also de- 
tected by the groups searching the database and cDNA 
libraries for (CAG/CTG), repeats (Riggins et al., 1992; 
Li et al, 1993). These include brain natriuretic protein 
(GCT3H01), transcription factor AP-2 (GCT1D04), and 
CTG-B33 (GCTIAIO). We have also detected a putative 
homologous locus of the Machado- Joseph protein 
(GCT4A06). This clone shares 82% nucleic acid identity 
near the repeat region of MJDl, but has a different 
repeat motif [(GCCGCT)8 (GCr)3]. It is likely to be the 
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TABLE 3 
Other Loci Isolated In the Screen 



CHL/C name 


Repeat sequenced 


BLAST homology 


OCTl AOfi 

V_y Vv i X £x\J \J 


(AG 0)7 


T83552-CDNA clone 111124 


GCTlBl 2 


None 


M26434-HPRT gene 




fCAGl3-CCAG'l3 


U03 49 5 -transcription factor LSF-ID mRNA 


GCT3A05 


None 


R28340, M94046-CDNA clone 134756, MAZ mRNA 


GCT3E0 1 




R82424-CDNA clone 149143 


GCT3F03 


(CAG) 3- (GAG) 2" (CAG) 2 


M37760, X637550 high-sulfur keratin (many species) 


GCT3H02 


(CTG)5 


K00534, K01903-c-myc 


GCT3H05 


(GCT)3 CGG (GCT)3 


T27013, U17280-CDNA clone LLAB132A10, StAR mRNA 


GCT4A06 


(GCCGCT)8 {GCT)3 


MJDl protein (homologous locus) 


GCT4C04 


(GO A) 4 


ribosomal nontranscribed spacer 


GCT5F03 


(CAG)4 


X14720-c-fms proto-oncogene for CSF-1 receptor 


GCT5G07 


(GAG) 8 


S62539-insulln-receptor substrate 1 


GCT6A08 


None 


R59748-CDNA clone 42349, zeta protein 


GCT6B03 


(GCT)7 


R06288-CDNA done 126292 


GCT6D02 


(CAG)4 


M16801-mineralocortlcoid receptor mRNA 


GCT7A02 


{GCT)6 


T84379-CDNA clone 1 1 1 196 


GCT7D09 


(CTG)4 


X54134-HPTP epsilon mRNA 


GCT7E08 


None 


S47244-HB2B-high sulfur keratin B2B 


GCT7G07 


(AGC)4 


Z26491 -catechol o-methyltransferase 


GCT8A09 


(CAG)6 


T67179-CDNA done 66628 


GCTSD02 


(GCr)7 


U03398-receptor 4- IBB ligand mRNA 


GCT8G04 


None 


M23492-leukocyte common antigen T200 (CD45, LCA) 


GCT8H06 


(CAG)6-(CAG)7 


U23862-mcag32 chromosome 7 CTG repeat region 


GCT8H08 


None 


F11952, M86700-CDNA clone c-33gl2, phospholipase A2 mRNA 


GCT9D02 


None 


T25372-CDNA done BL29-2 


GCT10A03 


(CTG)6 


T08157, T34263-ESTS 06048, 65013 


GCTIODIO 


(GCT)6 


R06288-CDNA done 126292 


GCTIIEOI 


(GCT)5 


R57209-CDNA done F1503 


GCT12A08 


(GCT)5 


108101, 108711, M2446-patents, SFTP3 


GCT12A10 


(AGC)4 


U00115-zinc-finger protein bcl-3 


GCT12G09 


(GCT)7 


R39715-CDNA clone 136883 


GCT13C12 


(TGC)6 


215459, T39585-cDNAs, 20B07, 60900 


GCT13F08 


None 


J05272-IMP dehydrogenase type 1 mRNA 


GCT14A10 


None 


M91585-peregrin mRNA 


GCT15B09 


(GCT)9 


X82209-MN1 mRNA 


GCT15C09 


(AGC)3--(AGC)4 


T27046-CDNA clone LLAB212E08 


GCT15H11 


(CTG)4 


U25765-chromosome 17q21 mRNA done 


GCT16A10 


(GCT)4 


X52560-nuclear factor NF-IL6 


GCT16E12 


(GCT)5 


X15357-mRNA for natriuretic peptide receptor (ANP-A) 


GCT16H04 


(CAG) 4 


L17913-STS UT1873 


GCT17F03 


(TGC)7 


M33782-TFEB protein mRNA 


GCT17F06 


(GCT)5 


M78249-EST 00397 



Note, All clones were positive by hybridization. The list includes clones for which the primers did not amplify, a repeat was not reached 
by sequencing, the repeat was shorter than five perfect units, or primer design was impossible due to lack of flanking sequence. 



same as one of the MJD-like sequences described in 
the initial MJD discovery (Kawaguchi et al, 1994). We 
do not know whether this locus is expressed, but we have 
determined that the locus tentatively maps to chromo- 
some 8 by somatic cell hybrid mapping (unpublished 
data). 

The repeat expansion detection (RED) assay (Schal- 
ling et al, 1993) is another technique that has been 
used to facilitate cloning of other expanded loci. RED, 
which allows the identification of a triplet expansion 
in a given individual, requires no prior knowledge of 
flctnklng sequence. RED has allowed the detection of 
novel expansions of (CTG)^, (ATG),, (CCT)„, (CTT)^, 
and (TGG)y, trinucleotide repeats in the genomic DNA 
of normal individuals (Schalling et al, 1993; Linblad 



et al. 1994) as well as the presence of (CAG/CTG)^ 
triplet expansions In the genomic DNA of individuals 
with bipolar affective disorder and schizophrenia (Lin- 
blad et al, in press; 0*Donovan et al, 1995). Targeted 
cloning of long CTG triplet molecules detected by RED 
has proven elusive so far, in part due to the difficulty in 
propagating long trinucleotide repeat sequences in bacte- 
rial and yeast host systems. STSs generated in this screen 
are candidate markers to test in individuals identified by 
RED to have putative (CAG/CTG)^ expansions. 

These STSs will be useful in the continuing search 
for disease mutations caused by a trinucleotide repeat 
expansion. In addition, in cells with DNA repair de- 
fects, genes that contain (GCT)/CTG)^ and other micro- 
satellite repeats may lead to disease without large 
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expansions. For example, it has been shown that, in 
cells with a repair defect, the gene encoding the RII 
subunit of the TGF-y0 receptor has accumulated muta- 
tions in a polyadenine tract that may inactivate the 
gene (Markowitz et al, 1995). This suggests that all 
loci containing microsatellite repeats are candidates for 
disease- causing agents in some cancers, and the STSs 
described here may help to identify some of these as 
welL 
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